Starbucks Capstone Challenge¶

Introduction¶

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.

Not all users receive the same offer, and that is the challenge to solve with this data set.

Your task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

You'll be given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

Example¶

To give an example, a user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

However, there are a few things to watch out for in this data set. Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

Cleaning¶

This makes data cleaning especially important and tricky.

You'll also want to take into account that some demographic groups will make purchases even if they don't receive an offer. From a business perspective, if a customer is going to make a 10 dollar purchase without an offer anyway, you wouldn't want to send a buy 10 dollars get 2 dollars off offer. You'll want to try to assess what a certain demographic group will buy when not receiving any offers.

Final Advice¶

Because this is a capstone project, you are free to analyze the data any way you see fit. For example, you could build a machine learning model that predicts how much someone will spend based on demographics and offer type. Or you could build a model that predicts whether or not someone will respond to an offer. Or, you don't need to build a machine learning model at all. You could develop a set of heuristics that determine what offer you should send to each customer (i.e., 75 percent of women customers who were 35 years old responded to offer A vs 40 percent from the same demographic to offer B, so send offer A).

Data Sets¶

The data is contained in three files:

  • portfolio.json - containing offer ids and meta data about each offer (duration, type, etc.)
  • profile.json - demographic data for each customer
  • transcript.json - records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

portfolio.json

  • id (string) - offer id
  • offer_type (string) - type of offer ie BOGO, discount, informational
  • difficulty (int) - minimum required spend to complete an offer
  • reward (int) - reward given for completing an offer
  • duration (int) - time for offer to be open, in days
  • channels (list of strings)

profile.json

  • age (int) - age of the customer
  • became_member_on (int) - date when customer created an app account
  • gender (str) - gender of the customer (note some entries contain 'O' for other rather than M or F)
  • id (str) - customer id
  • income (float) - customer's income

transcript.json

  • event (str) - record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) - customer id
  • time (int) - time in hours since start of test. The data begins at time t=0
  • value - (dict of strings) - either an offer id or transaction amount depending on the record

Note: If you are using the workspace, you will need to go to the terminal and run the command conda update pandas before reading in the files. This is because the version of pandas in the workspace cannot read in the transcript.json file correctly, but the newest version of pandas can. You can access the termnal from the orange icon in the top left of this notebook.

You can see how to access the terminal and how the install works using the two images below. First you need to access the terminal:

Then you will want to run the above command:

Finally, when you enter back into the notebook (use the jupyter icon again), you should be able to run the below cell without any errors.

In [1]:
import pandas as pd
import numpy as np
import math
import json
%matplotlib inline

import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import fbeta_score, accuracy_score

# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

Data Understanding¶

Lets look at 3 different datasets.

Portfolio dataset¶

In [2]:
portfolio.head()
Out[2]:
reward channels difficulty duration offer_type id
0 10 [email, mobile, social] 10 7 bogo ae264e3637204a6fb9bb56bc8210ddfd
1 10 [web, email, mobile, social] 10 5 bogo 4d5c57ea9a6940dd891ad53e9dbe8da0
2 0 [web, email, mobile] 0 4 informational 3f207df678b143eea3cee63160fa8bed
3 5 [web, email, mobile] 5 7 bogo 9b98b8c7a33c4b65b9aebfe6a799e6d9
4 5 [web, email] 20 10 discount 0b1e1539f2cc45b7b9fa7c272da2e1d7
In [3]:
portfolio.describe()
Out[3]:
reward difficulty duration
count 10.000000 10.000000 10.000000
mean 4.200000 7.700000 6.500000
std 3.583915 5.831905 2.321398
min 0.000000 0.000000 3.000000
25% 2.000000 5.000000 5.000000
50% 4.000000 8.500000 7.000000
75% 5.000000 10.000000 7.000000
max 10.000000 20.000000 10.000000
In [4]:
portfolio.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   reward      10 non-null     int64 
 1   channels    10 non-null     object
 2   difficulty  10 non-null     int64 
 3   duration    10 non-null     int64 
 4   offer_type  10 non-null     object
 5   id          10 non-null     object
dtypes: int64(3), object(3)
memory usage: 608.0+ bytes
In [5]:
portfolio.shape
Out[5]:
(10, 6)

Profile dataset¶

In [6]:
profile.head()
Out[6]:
gender age id became_member_on income
0 None 118 68be06ca386d4c31939f3a4f0e3dd783 20170212 NaN
1 F 55 0610b486422d4921ae7d2bf64640c50b 20170715 112000.0
2 None 118 38fe809add3b4fcf9315a9694bb96ff5 20180712 NaN
3 F 75 78afa995795e4d85b5d9ceeca43f5fef 20170509 100000.0
4 None 118 a03223e636434f42ac4c3df47e8bac43 20170804 NaN
In [7]:
profile.describe()
Out[7]:
age became_member_on income
count 17000.000000 1.700000e+04 14825.000000
mean 62.531412 2.016703e+07 65404.991568
std 26.738580 1.167750e+04 21598.299410
min 18.000000 2.013073e+07 30000.000000
25% 45.000000 2.016053e+07 49000.000000
50% 58.000000 2.017080e+07 64000.000000
75% 73.000000 2.017123e+07 80000.000000
max 118.000000 2.018073e+07 120000.000000
In [8]:
profile.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   gender            14825 non-null  object 
 1   age               17000 non-null  int64  
 2   id                17000 non-null  object 
 3   became_member_on  17000 non-null  int64  
 4   income            14825 non-null  float64
dtypes: float64(1), int64(2), object(2)
memory usage: 664.2+ KB
In [9]:
profile.shape
Out[9]:
(17000, 5)
In [10]:
profile[profile['gender'].isnull()].count()
Out[10]:
gender                 0
age                 2175
id                  2175
became_member_on    2175
income                 0
dtype: int64
In [11]:
profile[profile['income'].isnull()].count()
Out[11]:
gender                 0
age                 2175
id                  2175
became_member_on    2175
income                 0
dtype: int64
In [12]:
profile.duplicated().sum()
Out[12]:
0
In [13]:
profile['gender'].unique()
Out[13]:
array([None, 'F', 'M', 'O'], dtype=object)
We can say that there is 2175 rows have null values for gender and income columns, and no duplicate rows.¶

Transcript dataset¶

In [14]:
transcript.head()
Out[14]:
person event value time
0 78afa995795e4d85b5d9ceeca43f5fef offer received {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 0
1 a03223e636434f42ac4c3df47e8bac43 offer received {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} 0
2 e2127556f4f64592b11af22de27a7932 offer received {'offer id': '2906b810c7d4411798c6938adc9daaa5'} 0
3 8ec6ce2a7e7949b1bf142def7d0e0586 offer received {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} 0
4 68617ca6246f4fbc85e91a2a49552598 offer received {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} 0
In [15]:
transcript.describe()
Out[15]:
time
count 306534.000000
mean 366.382940
std 200.326314
min 0.000000
25% 186.000000
50% 408.000000
75% 528.000000
max 714.000000
In [16]:
transcript.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 306534 entries, 0 to 306533
Data columns (total 4 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   person  306534 non-null  object
 1   event   306534 non-null  object
 2   value   306534 non-null  object
 3   time    306534 non-null  int64 
dtypes: int64(1), object(3)
memory usage: 9.4+ MB
In [17]:
transcript.shape
Out[17]:
(306534, 4)
In [18]:
transcript.isnull().sum()
Out[18]:
person    0
event     0
value     0
time      0
dtype: int64
In [19]:
transcript['event'].unique()
Out[19]:
array(['offer received', 'offer viewed', 'transaction', 'offer completed'],
      dtype=object)

Data Cleaning¶

Lets clean our datasets.

In [20]:
# check the outliers of age column
fig = px.box(profile, y="age")
fig.show()
In [21]:
# let's check the income of age 118

check_age = profile[profile['age']== 118]
check_age.head()
Out[21]:
gender age id became_member_on income
0 None 118 68be06ca386d4c31939f3a4f0e3dd783 20170212 NaN
2 None 118 38fe809add3b4fcf9315a9694bb96ff5 20180712 NaN
4 None 118 a03223e636434f42ac4c3df47e8bac43 20170804 NaN
6 None 118 8ec6ce2a7e7949b1bf142def7d0e0586 20170925 NaN
7 None 118 68617ca6246f4fbc85e91a2a49552598 20171002 NaN
In [22]:
check_age.describe()
Out[22]:
age became_member_on income
count 2175.0 2.175000e+03 0.0
mean 118.0 2.016804e+07 NaN
std 0.0 1.009105e+04 NaN
min 118.0 2.013080e+07 NaN
25% 118.0 2.016070e+07 NaN
50% 118.0 2.017073e+07 NaN
75% 118.0 2.017123e+07 NaN
max 118.0 2.018073e+07 NaN
In [23]:
# As result we found that age 118 is a dummy data. there are 2175 rows in gender and income have no info, so we will drop them. 

profile = profile[profile['age']<118].reset_index(drop=True)
In [24]:
profile.describe()
Out[24]:
age became_member_on income
count 14825.000000 1.482500e+04 14825.000000
mean 54.393524 2.016689e+07 65404.991568
std 17.383705 1.188565e+04 21598.299410
min 18.000000 2.013073e+07 30000.000000
25% 42.000000 2.016052e+07 49000.000000
50% 55.000000 2.017080e+07 64000.000000
75% 66.000000 2.017123e+07 80000.000000
max 101.000000 2.018073e+07 120000.000000
In [25]:
profile.head()
Out[25]:
gender age id became_member_on income
0 F 55 0610b486422d4921ae7d2bf64640c50b 20170715 112000.0
1 F 75 78afa995795e4d85b5d9ceeca43f5fef 20170509 100000.0
2 M 68 e2127556f4f64592b11af22de27a7932 20180426 70000.0
3 M 65 389bc3fa690240e798340f5a15918d5c 20180209 53000.0
4 M 58 2eeac8d8feae4a8cad5a6af0499a211d 20171111 51000.0
In [26]:
# convert (became_member_on) to date
profile.became_member_on = pd.to_datetime(profile.became_member_on, format = '%Y%m%d')
In [27]:
profile.head()
Out[27]:
gender age id became_member_on income
0 F 55 0610b486422d4921ae7d2bf64640c50b 2017-07-15 112000.0
1 F 75 78afa995795e4d85b5d9ceeca43f5fef 2017-05-09 100000.0
2 M 68 e2127556f4f64592b11af22de27a7932 2018-04-26 70000.0
3 M 65 389bc3fa690240e798340f5a15918d5c 2018-02-09 53000.0
4 M 58 2eeac8d8feae4a8cad5a6af0499a211d 2017-11-11 51000.0
In [28]:
# keep only the year
profile['became_member_on'] = pd.Series([int(str(x)[:4]) for x in profile['became_member_on']])
In [29]:
profile.head()
Out[29]:
gender age id became_member_on income
0 F 55 0610b486422d4921ae7d2bf64640c50b 2017 112000.0
1 F 75 78afa995795e4d85b5d9ceeca43f5fef 2017 100000.0
2 M 68 e2127556f4f64592b11af22de27a7932 2018 70000.0
3 M 65 389bc3fa690240e798340f5a15918d5c 2018 53000.0
4 M 58 2eeac8d8feae4a8cad5a6af0499a211d 2017 51000.0
In [30]:
# rename columns
profile = profile.rename(columns={'id':'customer_id'})
In [31]:
profile.head()
Out[31]:
gender age customer_id became_member_on income
0 F 55 0610b486422d4921ae7d2bf64640c50b 2017 112000.0
1 F 75 78afa995795e4d85b5d9ceeca43f5fef 2017 100000.0
2 M 68 e2127556f4f64592b11af22de27a7932 2018 70000.0
3 M 65 389bc3fa690240e798340f5a15918d5c 2018 53000.0
4 M 58 2eeac8d8feae4a8cad5a6af0499a211d 2017 51000.0
In [32]:
# convert day to hour

portfolio['duration']=portfolio['duration'] * 24
portfolio.head()
Out[32]:
reward channels difficulty duration offer_type id
0 10 [email, mobile, social] 10 168 bogo ae264e3637204a6fb9bb56bc8210ddfd
1 10 [web, email, mobile, social] 10 120 bogo 4d5c57ea9a6940dd891ad53e9dbe8da0
2 0 [web, email, mobile] 0 96 informational 3f207df678b143eea3cee63160fa8bed
3 5 [web, email, mobile] 5 168 bogo 9b98b8c7a33c4b65b9aebfe6a799e6d9
4 5 [web, email] 20 240 discount 0b1e1539f2cc45b7b9fa7c272da2e1d7
In [33]:
# rename id column to offer_id
portfolio=portfolio.rename(columns={'id':'offer_id'})
In [34]:
# one hot encoding for channel column
portfolio['email']=portfolio['channels'].apply(lambda x: 1 if 'email' in x else 0)
portfolio['mobile']=portfolio['channels'].apply(lambda x: 1 if 'mobile' in x else 0)
portfolio['social']=portfolio['channels'].apply(lambda x: 1 if 'social' in x else 0)
portfolio['web']=portfolio['channels'].apply(lambda x: 1 if 'web' in x else 0)
In [35]:
portfolio.drop(['channels'], axis=1, inplace=True)
portfolio.head()
Out[35]:
reward difficulty duration offer_type offer_id email mobile social web
0 10 10 168 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 1 1 0
1 10 10 120 bogo 4d5c57ea9a6940dd891ad53e9dbe8da0 1 1 1 1
2 0 0 96 informational 3f207df678b143eea3cee63160fa8bed 1 1 0 1
3 5 5 168 bogo 9b98b8c7a33c4b65b9aebfe6a799e6d9 1 1 0 1
4 5 20 240 discount 0b1e1539f2cc45b7b9fa7c272da2e1d7 1 0 0 1
In [36]:
transcript.head()
Out[36]:
person event value time
0 78afa995795e4d85b5d9ceeca43f5fef offer received {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} 0
1 a03223e636434f42ac4c3df47e8bac43 offer received {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} 0
2 e2127556f4f64592b11af22de27a7932 offer received {'offer id': '2906b810c7d4411798c6938adc9daaa5'} 0
3 8ec6ce2a7e7949b1bf142def7d0e0586 offer received {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} 0
4 68617ca6246f4fbc85e91a2a49552598 offer received {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} 0
In [37]:
# divide value column into 2 due to different info
transcript['offer_id'] = [[*i.values()][0]if [*i.keys()][0] in ['offer id','offer_id'] else None for i in transcript.value]
transcript['amount'] = [np.round([*i.values()][0], decimals=2)if [*i.keys()][0] == 'amount' else None for i in transcript.value]
In [38]:
# rename
transcript=transcript.rename(columns={'person':'customer_id'})
In [39]:
# drop value column
transcript=transcript.drop(columns={'value'})
transcript.head()
Out[39]:
customer_id event time offer_id amount
0 78afa995795e4d85b5d9ceeca43f5fef offer received 0 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN
1 a03223e636434f42ac4c3df47e8bac43 offer received 0 0b1e1539f2cc45b7b9fa7c272da2e1d7 NaN
2 e2127556f4f64592b11af22de27a7932 offer received 0 2906b810c7d4411798c6938adc9daaa5 NaN
3 8ec6ce2a7e7949b1bf142def7d0e0586 offer received 0 fafdcd668e3743c1bb461111dcafc2a4 NaN
4 68617ca6246f4fbc85e91a2a49552598 offer received 0 4d5c57ea9a6940dd891ad53e9dbe8da0 NaN
In [40]:
portfolio.head()
Out[40]:
reward difficulty duration offer_type offer_id email mobile social web
0 10 10 168 bogo ae264e3637204a6fb9bb56bc8210ddfd 1 1 1 0
1 10 10 120 bogo 4d5c57ea9a6940dd891ad53e9dbe8da0 1 1 1 1
2 0 0 96 informational 3f207df678b143eea3cee63160fa8bed 1 1 0 1
3 5 5 168 bogo 9b98b8c7a33c4b65b9aebfe6a799e6d9 1 1 0 1
4 5 20 240 discount 0b1e1539f2cc45b7b9fa7c272da2e1d7 1 0 0 1
In [41]:
profile.head()
Out[41]:
gender age customer_id became_member_on income
0 F 55 0610b486422d4921ae7d2bf64640c50b 2017 112000.0
1 F 75 78afa995795e4d85b5d9ceeca43f5fef 2017 100000.0
2 M 68 e2127556f4f64592b11af22de27a7932 2018 70000.0
3 M 65 389bc3fa690240e798340f5a15918d5c 2018 53000.0
4 M 58 2eeac8d8feae4a8cad5a6af0499a211d 2017 51000.0
In [42]:
# Merge 1 (merge profile with transcript)
df = pd.merge(profile,transcript, on='customer_id')
In [43]:
df.head()
Out[43]:
gender age customer_id became_member_on income event time offer_id amount
0 F 55 0610b486422d4921ae7d2bf64640c50b 2017 112000.0 transaction 18 None 21.51
1 F 55 0610b486422d4921ae7d2bf64640c50b 2017 112000.0 transaction 144 None 32.28
2 F 55 0610b486422d4921ae7d2bf64640c50b 2017 112000.0 offer received 408 9b98b8c7a33c4b65b9aebfe6a799e6d9 NaN
3 F 55 0610b486422d4921ae7d2bf64640c50b 2017 112000.0 offer received 504 3f207df678b143eea3cee63160fa8bed NaN
4 F 55 0610b486422d4921ae7d2bf64640c50b 2017 112000.0 transaction 528 None 23.22
In [44]:
# Renumber customer_id
customer_id_dict = df['customer_id'].unique()
customer_id_dict = pd.Series(customer_id_dict).to_dict()
customer_id_dict = dict([(v, k) for k, v in customer_id_dict.items()]) 
In [45]:
# Renumber offer_id
offer_id_dict = df['offer_id'].unique()
offer_id_dict = pd.Series(offer_id_dict).to_dict()
offer_id_dict = dict([(v, k) for k, v in offer_id_dict.items()]) 
In [46]:
offer_id_dict
Out[46]:
{None: 0,
 '9b98b8c7a33c4b65b9aebfe6a799e6d9': 1,
 '3f207df678b143eea3cee63160fa8bed': 2,
 '5a8bc65990b245e5a138643cd4eb9837': 3,
 'ae264e3637204a6fb9bb56bc8210ddfd': 4,
 'f19421c1d4aa40978ebb69ca19b0e20d': 5,
 '2906b810c7d4411798c6938adc9daaa5': 6,
 'fafdcd668e3743c1bb461111dcafc2a4': 7,
 '2298d6c36e964ae4a3e7e9706d1fb8c2': 8,
 '0b1e1539f2cc45b7b9fa7c272da2e1d7': 9,
 '4d5c57ea9a6940dd891ad53e9dbe8da0': 10}
In [47]:
# Mapping the new ids to original df
df['customer_id'] = df['customer_id'].map(customer_id_dict)
df['offer_id'] = df['offer_id'].map(offer_id_dict)
In [48]:
# Mapping the new offer id to original portfolio dataset for future merge
portfolio['offer_id'] = portfolio['offer_id'].map(offer_id_dict)
In [49]:
df.head()
Out[49]:
gender age customer_id became_member_on income event time offer_id amount
0 F 55 0 2017 112000.0 transaction 18 0 21.51
1 F 55 0 2017 112000.0 transaction 144 0 32.28
2 F 55 0 2017 112000.0 offer received 408 1 NaN
3 F 55 0 2017 112000.0 offer received 504 2 NaN
4 F 55 0 2017 112000.0 transaction 528 0 23.22
In [50]:
portfolio.head()
Out[50]:
reward difficulty duration offer_type offer_id email mobile social web
0 10 10 168 bogo 4 1 1 1 0
1 10 10 120 bogo 10 1 1 1 1
2 0 0 96 informational 2 1 1 0 1
3 5 5 168 bogo 1 1 1 0 1
4 5 20 240 discount 9 1 0 0 1
In [51]:
# df1 shows us the only offer's transactions
df1 = pd.merge(df, portfolio, on='offer_id')
df1 = df1.drop(columns={'amount'})
In [52]:
df1.head()
Out[52]:
gender age customer_id became_member_on income event time offer_id reward difficulty duration offer_type email mobile social web
0 F 55 0 2017 112000.0 offer received 408 1 5 5 168 bogo 1 1 0 1
1 F 55 0 2017 112000.0 offer completed 528 1 5 5 168 bogo 1 1 0 1
2 F 75 1 2017 100000.0 offer received 0 1 5 5 168 bogo 1 1 0 1
3 F 75 1 2017 100000.0 offer viewed 6 1 5 5 168 bogo 1 1 0 1
4 F 75 1 2017 100000.0 offer completed 132 1 5 5 168 bogo 1 1 0 1
In [53]:
df1.shape
Out[53]:
(148805, 16)
In [54]:
#check events
df1['event'].unique()
Out[54]:
array(['offer received', 'offer completed', 'offer viewed'], dtype=object)
In [55]:
# df1 shows us the only offer's transactions
df1.describe()
Out[55]:
age customer_id became_member_on income time offer_id reward difficulty duration email mobile social web
count 148805.000000 148805.000000 148805.000000 148805.000000 148805.000000 148805.000000 148805.000000 148805.000000 148805.000000 148805.0 148805.000000 148805.000000 148805.000000
mean 54.779282 7408.517308 2016.576365 66414.119149 354.570223 5.706703 4.442445 7.890561 159.004980 1.0 0.917160 0.658311 0.806747
std 17.193533 4278.063634 1.186534 21496.947967 198.311301 2.796607 3.372362 5.041335 51.192852 0.0 0.275641 0.474277 0.394851
min 18.000000 0.000000 2013.000000 30000.000000 0.000000 1.000000 0.000000 0.000000 72.000000 1.0 0.000000 0.000000 0.000000
25% 43.000000 3705.000000 2016.000000 51000.000000 168.000000 4.000000 2.000000 5.000000 120.000000 1.0 1.000000 0.000000 1.000000
50% 56.000000 7400.000000 2017.000000 65000.000000 408.000000 6.000000 5.000000 10.000000 168.000000 1.0 1.000000 1.000000 1.000000
75% 67.000000 11110.000000 2017.000000 81000.000000 510.000000 8.000000 5.000000 10.000000 168.000000 1.0 1.000000 1.000000 1.000000
max 101.000000 14824.000000 2018.000000 120000.000000 714.000000 10.000000 10.000000 20.000000 240.000000 1.0 1.000000 1.000000 1.000000
In [56]:
df2=df.drop(columns={'offer_id'})
In [57]:
# df3 shows us w/o any offers transactions so only transaction events
df3=df2.dropna(subset=['amount'])
In [58]:
df3.head()
Out[58]:
gender age customer_id became_member_on income event time amount
0 F 55 0 2017 112000.0 transaction 18 21.51
1 F 55 0 2017 112000.0 transaction 144 32.28
4 F 55 0 2017 112000.0 transaction 528 23.22
8 F 75 1 2017 100000.0 transaction 132 19.89
10 F 75 1 2017 100000.0 transaction 144 17.78
In [59]:
# df3 shows us w/o any offers transactions
df3.describe()
Out[59]:
age customer_id became_member_on income time amount
count 123957.000000 123957.000000 123957.000000 123957.000000 123957.000000 123957.000000
mean 52.713965 7416.438079 2016.314666 61843.510250 381.278201 13.996325
std 17.906667 4277.226631 1.241192 20661.890394 201.829085 31.749420
min 18.000000 0.000000 2013.000000 30000.000000 0.000000 0.050000
25% 39.000000 3721.000000 2016.000000 46000.000000 210.000000 3.660000
50% 54.000000 7424.000000 2017.000000 60000.000000 402.000000 10.800000
75% 65.000000 11105.000000 2017.000000 74000.000000 552.000000 19.130000
max 101.000000 14824.000000 2018.000000 120000.000000 714.000000 1062.280000
In [60]:
# df shows us whole transactions
df.describe()
Out[60]:
age customer_id became_member_on income time offer_id amount
count 272762.000000 272762.000000 272762.000000 272762.000000 272762.000000 272762.000000 123957.000000
mean 53.840696 7412.116911 2016.457435 64337.000755 366.707694 3.113286 13.996325
std 17.551337 4277.677254 1.218663 21243.762941 200.359087 3.512949 31.749420
min 18.000000 0.000000 2013.000000 30000.000000 0.000000 0.000000 0.050000
25% 41.000000 3712.000000 2016.000000 48000.000000 186.000000 0.000000 3.660000
50% 55.000000 7409.000000 2017.000000 62000.000000 408.000000 1.000000 10.800000
75% 66.000000 11109.000000 2017.000000 78000.000000 528.000000 6.000000 19.130000
max 101.000000 14824.000000 2018.000000 120000.000000 714.000000 10.000000 1062.280000

Additional data cleaning process to be added for modelling.¶

we need to convert some data to metrics¶

In [61]:
df1_model = df1.copy(deep=True)
In [62]:
df1_model["gender"][df1_model["gender"] == "M"] = 1
df1_model["gender"][df1_model["gender"] == "F"] = -1
df1_model["gender"][df1_model["gender"] == "O"] = 0
In [63]:
df1_model.head()
Out[63]:
gender age customer_id became_member_on income event time offer_id reward difficulty duration offer_type email mobile social web
0 -1 55 0 2017 112000.0 offer received 408 1 5 5 168 bogo 1 1 0 1
1 -1 55 0 2017 112000.0 offer completed 528 1 5 5 168 bogo 1 1 0 1
2 -1 75 1 2017 100000.0 offer received 0 1 5 5 168 bogo 1 1 0 1
3 -1 75 1 2017 100000.0 offer viewed 6 1 5 5 168 bogo 1 1 0 1
4 -1 75 1 2017 100000.0 offer completed 132 1 5 5 168 bogo 1 1 0 1
In [64]:
df1_model.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 148805 entries, 0 to 148804
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   gender            148805 non-null  object 
 1   age               148805 non-null  int64  
 2   customer_id       148805 non-null  int64  
 3   became_member_on  148805 non-null  int64  
 4   income            148805 non-null  float64
 5   event             148805 non-null  object 
 6   time              148805 non-null  int64  
 7   offer_id          148805 non-null  int64  
 8   reward            148805 non-null  int64  
 9   difficulty        148805 non-null  int64  
 10  duration          148805 non-null  int64  
 11  offer_type        148805 non-null  object 
 12  email             148805 non-null  int64  
 13  mobile            148805 non-null  int64  
 14  social            148805 non-null  int64  
 15  web               148805 non-null  int64  
dtypes: float64(1), int64(12), object(3)
memory usage: 19.3+ MB
In [65]:
offer_type_dict = df1_model['offer_type'].unique()
In [66]:
offer_type_dict
Out[66]:
array(['bogo', 'informational', 'discount'], dtype=object)
In [67]:
df1_model["offer_type"][df1_model["offer_type"] == "bogo"] = 1
df1_model["offer_type"][df1_model["offer_type"] == "informational"] = -1
df1_model["offer_type"][df1_model["offer_type"] == "discount"] = 0
In [68]:
#encode 'event' data to numerical values
df1_model['event'] = df1_model['event'].map({'offer received':0, 'offer viewed':1, 'offer completed':2})
In [69]:
df1_model.head()
Out[69]:
gender age customer_id became_member_on income event time offer_id reward difficulty duration offer_type email mobile social web
0 -1 55 0 2017 112000.0 0 408 1 5 5 168 1 1 1 0 1
1 -1 55 0 2017 112000.0 2 528 1 5 5 168 1 1 1 0 1
2 -1 75 1 2017 100000.0 0 0 1 5 5 168 1 1 1 0 1
3 -1 75 1 2017 100000.0 1 6 1 5 5 168 1 1 1 0 1
4 -1 75 1 2017 100000.0 2 132 1 5 5 168 1 1 1 0 1
In [70]:
df1_model.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 148805 entries, 0 to 148804
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   gender            148805 non-null  object 
 1   age               148805 non-null  int64  
 2   customer_id       148805 non-null  int64  
 3   became_member_on  148805 non-null  int64  
 4   income            148805 non-null  float64
 5   event             148805 non-null  int64  
 6   time              148805 non-null  int64  
 7   offer_id          148805 non-null  int64  
 8   reward            148805 non-null  int64  
 9   difficulty        148805 non-null  int64  
 10  duration          148805 non-null  int64  
 11  offer_type        148805 non-null  object 
 12  email             148805 non-null  int64  
 13  mobile            148805 non-null  int64  
 14  social            148805 non-null  int64  
 15  web               148805 non-null  int64  
dtypes: float64(1), int64(13), object(2)
memory usage: 19.3+ MB
In [71]:
df1_model = df1_model.drop(columns = ["customer_id","email","mobile","social","web"])
df1_model.head()
Out[71]:
gender age became_member_on income event time offer_id reward difficulty duration offer_type
0 -1 55 2017 112000.0 0 408 1 5 5 168 1
1 -1 55 2017 112000.0 2 528 1 5 5 168 1
2 -1 75 2017 100000.0 0 0 1 5 5 168 1
3 -1 75 2017 100000.0 1 6 1 5 5 168 1
4 -1 75 2017 100000.0 2 132 1 5 5 168 1
In [72]:
df1.head()
Out[72]:
gender age customer_id became_member_on income event time offer_id reward difficulty duration offer_type email mobile social web
0 F 55 0 2017 112000.0 offer received 408 1 5 5 168 bogo 1 1 0 1
1 F 55 0 2017 112000.0 offer completed 528 1 5 5 168 bogo 1 1 0 1
2 F 75 1 2017 100000.0 offer received 0 1 5 5 168 bogo 1 1 0 1
3 F 75 1 2017 100000.0 offer viewed 6 1 5 5 168 bogo 1 1 0 1
4 F 75 1 2017 100000.0 offer completed 132 1 5 5 168 bogo 1 1 0 1

Now let's check the chrachterestics similarity between df1 and df3¶

In [73]:
plt.ylabel("Frequency")
plt.xlabel("Age")
plt.hist(df1['age'], color = "m", bins=24);
In [74]:
plt.ylabel("Frequency")
plt.xlabel("Age")
plt.hist(df3['age'], color = "m", bins=24);

We can see that age distribution is same for df1 and df3 datasets.¶

In [75]:
plt.ylabel("Frequency")
plt.xlabel("Become member in which year")
plt.hist(df1['became_member_on'], color = "c", bins=12);
In [76]:
plt.ylabel("Frequency")
plt.xlabel("Become member in which year")
plt.hist(df3['became_member_on'], color = "c", bins=12);

We can see that year distribution is same for df1 and df3 datasets.¶

In [77]:
plt.ylabel("Frequency")
plt.xlabel("Income")
plt.hist(df1['income'], color = "g", bins=24);
In [78]:
plt.ylabel("Frequency")
plt.xlabel("Income")
plt.hist(df3['income'], color = "g", bins=24);

We can notice that income distribution are quit different, specially higher than 75k, it mean that Starbucks has more target for customers who have income between 75k and 100k.¶

In [79]:
plt.ylabel("Frequency")
plt.xlabel("Offers time")
plt.hist(df1['time'], color = "c", bins=24);
In [80]:
plt.ylabel("Frequency")
plt.xlabel("Offers completed")
plt.hist(df1[df1['event']=="offer completed"]['time'], color = "c", bins=24);

After checking both of above figures, we can notice that Starbucks are not sending offers everyday, it usually decrease after second offer which meand, offers are going to decrease when customers not continuing accepting offers¶

In [81]:
plt.ylabel("Frequency")
plt.xlabel("Offers recieved in time")
plt.hist(df1[df1['event']=="offer received"]['time'], color = "b", bins=24);
In [82]:
plt.ylabel("Frequency")
plt.xlabel("Offers viewed in time")
plt.hist(df1[df1['event']=="offer viewed"]['time'], color = "b", bins=24);
In [83]:
plt.ylabel("Frequency")
plt.xlabel("Time")
plt.hist(df3['time'], color = "b", bins=24);

We can notice that transactions w/o offers has increased with the increased frequency of offers¶

In [84]:
px.histogram(df1, x="offer_type", color="gender",barmode='group')
In [85]:
px.histogram(df3, x="gender", color="gender",barmode='group')

As we can notice that; men get more offers.¶

In [86]:
px.histogram(df1, x="income", color="gender")

We can notice that women has higher income than men.¶

In [87]:
px.histogram(df1, x="time", color="event")

We can notice that increasing in offers viewed increase the offer complete as well.¶

In [88]:
px.histogram(df1, x="offer_type", color="event",barmode='group', title="Offers vs Event Distributions")

We can notice that discount and bogo has almost same offer recieved, but bogo has higher offer viewes, overall discount has more slicely offer completed¶

In [89]:
px.histogram(df1, x="offer_id", color="event",barmode='group', title="Detailed Offers vs Event Distributions")
In [90]:
px.histogram(df1, x="time", color="offer_id")

We can notice that the diificulity of offer affect the completion ratio¶

In [91]:
px.histogram(df1, x="event", color="gender", barmode='group')

We can notice that the ratio of completing the offers for women are higher than men.¶

In [92]:
px.histogram(df1[df["gender"] == "M"], x="offer_id", color="event", histfunc="count",barmode='group')
In [93]:
px.histogram(df1[df["gender"] == "F"], x="offer_id", color="event", histfunc="count",barmode='group')

We can notice that men tends to complete offers if it has high difficulity more than women, also men and women sharing the same first 3 difficulities¶

In [94]:
px.histogram(df1, x="age", color="offer_id")

We can notice that women are older than men as women are not sensitive to offer difficulties during age, also men prefer low diffidulities for men under age of 70, and higher can prefer high difficulties once¶

In [95]:
px.histogram(df1, x="offer_type", color="became_member_on", barmode='group')
In [96]:
px.histogram(df1, x="event", color="became_member_on", barmode='group')

Starbucks having same strategy for whole users, we can notice increasing in completing offer for the first 3 years, but in 2016 and 2018 completing offers are getting decreased, offers should be more clear for the old customers to retain their completing new offers¶

In [97]:
px.histogram(df1, x="income", color="offer_id")

men who have 35K to 50K income have different choices on different offers. women who have 65K to 85K income have different choices on different offers.¶

Understanding of data to see the difference between transactions and offer resulting.¶

When we examine offers sent and transactions data independently, we find that their demographics are almost identical.

Age and year of membership distributions are substantially identical across both datasets.

For 2 datasets, income is considerably different, especially when it is more than $75,000. In other words, 75 to 100K clients with higher incomes received more offers from Starbucks as a proportion. This group of clients was their aim.

Offers weren't delivered every day. Following the second offer, the interval was lowered. On the other side, after day 0 of continual offer sending, offer receipts would decline.

Men receive more offers than women do in terms of quantity.

Distribution data indicate that women earn more than males.

Findings¶

1) Income distribution are quit different, specially higher than 75k, it mean that Starbucks has more target for customers who have income between 75k and 100k.

2) Starbucks are not sending offers everyday, it usually decrease after second offer which meand, offers are going to decrease when customers not continuing accepting offers

3) Transactions w/o offers has increased with the increased frequency of offers

4) Men get more offers.

5) The ratio of completing the offers for women are higher than men.

6) Discount and bogo has almost same offer recieved, but bogo has higher offer viewes, overall discount has more slicely offer completed

7) Men tends to complete offers if it has high difficulity more than women, also

8) Men and women sharing the same first 3 difficulities

9) Women are not sensitive to offer difficulties during age, also

10) Men prefer low diffidulities for men under age of 70, and higher can prefer high difficulties once

11) Starbucks having same strategy for whole users, we can notice increasing in completing offer for the first 3 years

12) In 2016 and 2018 completing offers are getting decreased, offers should be more clear for the old customers to retain their completing new offers.

13) Men who have 35K to 50K income have different choices on different offers.

14) Women who have 65K to 85K income have different choices on different offers.

Summary of visualization outcomes¶

Men customers between the ages of 20 and 45 responded to offers 7 and 8 more favorably than other offers.

Men customers between the ages of 45 and 60 responded to alternative offers instead of 7 and 8.

Men customers with incomes between $35k and $50k responded to different offers instead of 4,6,9.

——-

Women tend to devalue offers but are less sensitive to the types of offers. Expand the discounts available to women.

Women customers with incomes between $65,000 and $85,000 responded to offers 1 and 10.

———-

Concentrate primarily on offers 7,8,5, and 1. Higher retention rate.

In general, the retention rate is higher for third- and fourth-year members.

In [98]:
sns.heatmap(df1_model.corr(method="kendall"),annot=True, fmt="0.2f");
In [99]:
df1_model.head()
Out[99]:
gender age became_member_on income event time offer_id reward difficulty duration offer_type
0 -1 55 2017 112000.0 0 408 1 5 5 168 1
1 -1 55 2017 112000.0 2 528 1 5 5 168 1
2 -1 75 2017 100000.0 0 0 1 5 5 168 1
3 -1 75 2017 100000.0 1 6 1 5 5 168 1
4 -1 75 2017 100000.0 2 132 1 5 5 168 1
In [100]:
df1_model_Q2 = df1_model[df1_model['gender']==1].reset_index(drop=True)
In [101]:
sns.heatmap(df1_model_Q2.corr(method="kendall"), annot=True, fmt="0.2f");
In [102]:
df1_model_y = df1_model.copy(deep=True)
In [103]:
X = df1_model.drop(columns=["event"])
y = df1_model.drop(columns = ["age","became_member_on","gender","income","time",
                              "offer_id","difficulty","duration","offer_type","reward"])
In [104]:
y.head()
Out[104]:
event
0 0
1 2
2 0
3 1
4 2
In [105]:
scaler = MinMaxScaler()

def testandtrain(X,y,category=False, channel =False):

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
    X_train.shape, X_test.shape, y_train.shape, y_test.shape  
    
    X_train.age = scaler.fit_transform(X_train.age.values.reshape(-1, 1))
    X_test.age = scaler.fit_transform(X_test.age.values.reshape(-1, 1))

    X_train.became_member_on = scaler.fit_transform(X_train.became_member_on.values.reshape(-1, 1))
    X_test.became_member_on = scaler.fit_transform(X_test.became_member_on.values.reshape(-1, 1))

    X_train.income = scaler.fit_transform(X_train.income.values.reshape(-1, 1))
    X_test.income = scaler.fit_transform(X_test.income.values.reshape(-1, 1))
    
    X_train.time = scaler.fit_transform(X_train.time.values.reshape(-1, 1))
    X_test.time = scaler.fit_transform(X_test.time.values.reshape(-1, 1))
        
    X_train.offer_id = scaler.fit_transform(X_train.offer_id.values.reshape(-1, 1))
    X_test.offer_id = scaler.fit_transform(X_test.offer_id.values.reshape(-1, 1))
        
    X_train.difficulty = scaler.fit_transform(X_train.difficulty.values.reshape(-1, 1))
    X_test.difficulty = scaler.fit_transform(X_test.difficulty.values.reshape(-1, 1))
                
    X_train.duration = scaler.fit_transform(X_train.duration.values.reshape(-1, 1))
    X_test.duration = scaler.fit_transform(X_test.duration.values.reshape(-1, 1))
  
    X_train.reward = scaler.fit_transform(X_train.reward.values.reshape(-1, 1))
    X_test.reward = scaler.fit_transform(X_test.reward.values.reshape(-1, 1))
  
    X_train.gender = scaler.fit_transform(X_train.gender.values.reshape(-1, 1))
    X_test.gender = scaler.fit_transform(X_test.gender.values.reshape(-1, 1))

    X_train.offer_type = scaler.fit_transform(X_train.offer_type.values.reshape(-1, 1))
    X_test.offer_type = scaler.fit_transform(X_test.offer_type.values.reshape(-1, 1))

  
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = testandtrain(X,y)
In [106]:
X_train.shape
Out[106]:
(74402, 10)
In [107]:
y_train.shape
Out[107]:
(74402, 1)
In [108]:
X_test.shape
Out[108]:
(74403, 10)
In [109]:
y_test.shape
Out[109]:
(74403, 1)
In [110]:
X_train.describe()
Out[110]:
gender age became_member_on income time offer_id reward difficulty duration offer_type
count 74402.000000 74402.000000 74402.000000 74402.000000 74402.000000 74402.000000 74402.000000 74402.000000 74402.000000 74402.000000
mean 0.564387 0.442281 0.715704 0.404558 0.497223 0.522162 0.444792 0.393850 0.517254 0.639553
std 0.491918 0.206864 0.237453 0.238705 0.277118 0.311485 0.337052 0.251632 0.304082 0.355390
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.301205 0.600000 0.233333 0.243697 0.222222 0.200000 0.250000 0.285714 0.500000
50% 1.000000 0.445783 0.800000 0.388889 0.571429 0.555556 0.500000 0.500000 0.571429 0.500000
75% 1.000000 0.578313 0.800000 0.566667 0.714286 0.777778 0.500000 0.500000 0.571429 1.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
In [111]:
sns.heatmap(X_test.corr(method="kendall"), annot=True, fmt="0.2f");
In [112]:
def train_test(model):
   
    train_prediction = (model.fit(X_train, y_train)).predict(X_train)
    test_predictions = (model.fit(X_train, y_train)).predict(X_test)
    train_f1_score = accuracy_score(y_train, train_prediction) * 100
    test_f1_score = fbeta_score(y_test, test_predictions, beta=0.5, average='micro') * 100
    
    return train_f1_score, test_f1_score, test_predictions
In [113]:
dtc = DecisionTreeClassifier(random_state = 1)
dtc_train_f1_score, dtc_test_f1_score,dtc_test_predictions = train_test(dtc)
In [114]:
models = {'Model': ['DecisionTreeClassifier'], 
          'Train F1 score ':[dtc_train_f1_score], 
          'Test F1 score': [dtc_test_f1_score] }
          
results = pd.DataFrame(models)
results
Out[114]:
Model Train F1 score Test F1 score
0 DecisionTreeClassifier 94.631865 63.713829
In [115]:
X_test.head()
Out[115]:
gender age became_member_on income time offer_id reward difficulty duration offer_type
93490 1.0 0.216867 1.0 0.200000 0.294118 0.666667 0.2 0.50 1.000000 0.5
79279 1.0 0.277108 0.0 0.544444 0.705882 0.555556 0.2 0.50 0.571429 0.5
5980 1.0 0.518072 0.2 0.233333 0.638655 0.000000 0.5 0.25 0.571429 1.0
124850 0.0 0.277108 0.6 0.233333 0.705882 0.888889 0.5 1.00 1.000000 0.5
46034 1.0 0.590361 0.6 0.655556 0.470588 0.333333 1.0 0.50 0.571429 1.0
In [116]:
y_test.head()
Out[116]:
event
93490 2
79279 0
5980 1
124850 0
46034 0
In [117]:
dtc_test_predictions
Out[117]:
array([2, 0, 1, ..., 2, 0, 0], dtype=int64)
In [118]:
pred_out=pd.DataFrame(dtc_test_predictions, columns=['predicted_outcome'])
pred_out.head()
Out[118]:
predicted_outcome
0 2
1 0
2 1
3 0
4 1
In [ ]: